Bayesian Hierarchical Modeling of
Cell-Type-Specific Gene Regulation
Michael Love,
Assistant Professor,
Dept of Biostatistics
Dept of Genetics
Biological aim: cell-type-specific regulation
In particular, effect of common genetic variation on …
Mouse brain, dentate gyrus. Livet, Weissman, Sanes and Lichtman/Harvard University
Bayesian Hierarchical Models
Let’s demystify — what do we plan to do
- A parametric model for regulation
- Parameters will represent an abstract relationship
- E.g. SNP \(\rightarrow\) expression, or SNP \(\rightarrow\) chromatin accessibility
- A hierarchy to compare these across cell types
- Specify distributions relating parameters to e/o
- Bayes theory and comp. frameworks to make inference
What is a likelihood?
\[ y_i \sim N(\mu, 1) \]

Bayesian model
What is a posterior? Likelihood \(\times\) some other distribution

What is a Bayesian Hierarchical Model?
- Like before, Bayesian model for data \(y_i\)
- But there is some structure in the data
- E.g. data on schools, schools are in grouped into regions
- Multiple \(\mu\) to estimate (say, one per school)
A simple hierarchical model to start
\[
\begin{align}
y_i &\sim N(\mu_i, \sigma^2=1) \\
\mu_i &\sim N(0, \sigma^2=A) \\
\end{align}
\]
\[
\hat{\mu}_i^{Bayes} = \left(1 - \frac{1}{A+1} \right) y_i
\]
“Stein’s Estimation Rule and Its Competitors - An Empirical Bayes Approach.” Bradley Efron and Carl Morris,
Journal of the American Statistical Association, Vol. 68, No. 341 (Mar., 1973), pp. 117-130 (14 pages) DOI: 10.2307/2284155
Let’s plug in some numbers
Let’s try A = 1 (SD = 1) and A = 9 (SD = 3)

What if A is not know?
\[ \hat{\mu}_i^{James-Stein} = \left(1 - \frac{1}{Z} \right) y_i \]
Empirical Bayes = estimate distribution of \(\mu_i\) using \(y_i\)
where \(Z = \frac{1}{n-2} \sum_{i=1}^n y_i^2\)
Roughly: estimating A + 1 = observed variance
This makes sense: \(y_i = \mu_i + \varepsilon_i\)
Let’s try this out
\[
\begin{align}
\hat{\mu}_i^{MLE} &= y_i \\
\hat{\mu}_i^{naive} &= 0
\end{align}
\]

A more complex model
Schools (\(\mu_i\)) clustered in regions (\(\theta_k\)).
\(k(i)\) gives the region for school \(i\).
\[
\begin{align}
y_i &\sim N(\mu_i, 1) \\
\mu_i &\sim N(\theta_{k(i)}, 1) \\
\theta_k &\sim N(0, 5)
\end{align}
\]
- Not so simple to compute, switch to using a tool called Stan
- Many frameworks for performing MCMC where one specifies the data, parameters, and model
- Then “samples” (of parameters) from the posterior
- Stan is very fast.
How does this look in Stan?
\[
\begin{align}
y_i &\sim N(\mu_i, 1) \\
\mu_i &\sim N(\theta_{k(i)}, 1) \\
\theta_k &\sim N(0, 5)
\end{align}
\]
Model output (n=40, k=4)
This model samples very fast in Stan (hundreds of a second). We can look at posterior distributions:
Compared to truth
Look at the school level (\(\mu_i\)):
Did we do better than MLE?
Did we do better than MLE?
Look at the school level (\(\mu_i\)):
When will Bayes \(>\) a simpler per-region estimator?
James-Stein shrinks by \(\approx \left(1 - \frac{V_y}{V_\mu + V_y} \right)\)
When are Bayesian models useful?
- Bayes estimators converge to MLE as \(n \to \infty\), so small \(n\)
- Or, small \(n\), higher variance for some groups
- E.g. some regions having only a few schools
- Or, sharing information on mid-level parameters, e.g. \(V_\mu\)
BHM in genomics
- limma, edgeR, DESeq2, apeglm
- These are empirical Bayes, don’t involve MCMC
BHM in eQTL and GWAS
- eCAVIAR - estimated coefficients ~ multivariate normal
- Zhang, …, Chatterjee, Estimation of complex effect-size distributions using summary-level statistics Nat Gen (2018)


Modeling TF binding across cell types
- Love et al, Role of the chromatin landscape and sequence in determining … binding NAR (2017)
Modeling TF binding across cell types
BHM applied to neuronal assays in Stein lab
- MR Locus
- asks: \(\uparrow\) eQTL, \(\uparrow\) GWAS?
- mixture modeling of all effects, no thresholding
- uncertainty on the effect sizes
- pathQTL
- cell-type-specific mediation across multiple “layers”
- as in the TF binding case, focus on cell-type diffs
- incorporate SNR into the mediation